66 research outputs found
Data stream synchronisation for defining meaningful fMRI classification problems
Application of machine learning techniques to the functional Magnetic Resonance Imaging (fMRI) data is recently an active field of research. There is however one area which does not receive due attention in the literature
â preparation of the fMRI data for subsequent modelling. In this study we focus on the issue of synchronization of the stream of fMRI snapshots with the mental states of the subject, which is a form of smart filtering of the in-
put data, performed prior to building a predictive model. We demonstrate, investigate and thoroughly discuss the negative effects of lack of alignment between the two streams and propose an original data-driven approach to
efficiently address this problem. Our solution involves casting the issue as a constrained optimization problem in combination with an alternative classification accuracy assessment scheme, applicable to both batch and on-line
scenarios and able to capture information distributed across a number of input samples lifting the common simplifying i.i.d. assumption. The proposed method is tested using real fMRI data and experimentally compared to the state-of-the-art ensemble models reported in the literature, outperforming them by a wide margin
On analysis of complex network dynamics â changes in local topology
Social networks created based on data gathered in various computer systems are structures that constantly evolve. The nodes and their connections change because they are influenced by the external to the network events.. In this work we present a new approach to the description and quantification of patterns of complex dynamic social networks illustrated with the data from the Wroclaw University of Technology email dataset. We propose an approach based on discovery of local network connection patterns (in this case triads of nodes) as well as we measure and analyse their transitions during network evolution. We define the Triad Transition Matrix (TTM) containing the probabilities of transitions between triads, after that we show how it can help to discover the dynamic patterns of network evolution. One of the main issues when investigating the dynamical process is the selection of the time window size. Thus, the goal of this paper is also to investigate how the size of time window influences the shape of TTM and how the dynamics of triad number change depending on the window size. We have shown that, however the link stability in the network is low, the dynamic network evolution pattern expressed by the TTMs is relatively stable, and thus forming a background for fine-grained classification of complex networks dynamics. Our results open also vast possibilities of link and structure prediction of dynamic networks. The future research and applications stemming from our approach are also proposed and discussed
Physically inspired methods and development of data-driven predictive systems.
Traditionally building of predictive models is perceived as a combination of both science and art. Although the designer of a predictive system effectively follows a prescribed procedure, his domain knowledge as well as expertise and intuition in the field of machine learning are
often irreplaceable. However, in many practical situations it is possible to build wellâperforming predictive systems by following a rigorous methodology and offsetting not only the lack of domain knowledge but also partial lack of expertise and intuition, by computational power. The
generalised predictive model development cycle discussed in this thesis is an example of such methodology, which despite being computationally expensive, has been successfully applied to realâworld problems. The proposed predictive system design cycle is a purely dataâdriven approach. The quality of data used to build the system is thus of crucial importance. In practice however, the data is rarely perfect. Common problems include missing values, high dimensionality or very limited amount of labelled exemplars. In order to address these issues, this work investigated and exploited inspirations coming from physics. The novel use of wellâestablished physical models in the form of potential fields, has resulted in derivation of a comprehensive Electrostatic Field Classification
Framework for supervised and semiâsupervised learning from incomplete data. Although the computational power constantly becomes cheaper and more accessible, it is not
infinite. Therefore efficient techniques able to exploit finite amount of predictive information content of the data and limit the computational requirements of the resourceâhungry predictive system design procedure are very desirable. In designing such techniques this work once again
investigated and exploited inspirations coming from physics. By using an analogy with a set of interacting particles and the resulting Information Theoretic Learning framework, the Density Preserving Sampling technique has been derived. This technique acts as a computationally
efficient alternative for crossâvalidation, which fits well within the proposed methodology. All methods derived in this thesis have been thoroughly tested on a number of benchmark datasets. The proposed generalised predictive model design cycle has been successfully applied to two realâworld environmental problems, in which a comparative study of Density Preserving Sampling and crossâvalidation has also been performed confirming great potential of the proposed methods
Link Prediction Based on Subgraph Evolution in Dynamic Social Networks
We propose a new method for characterizing the dynamics of complex networks with its application to the link prediction problem. Our approach is based on the discovery of network subgraphs (in this study: triads of nodes) and measuring their transitions during network evolution. We define the Triad Transition Matrix (TTM) containing the probabilities of transitions between triads found in the network, then we show how it can help to discover and quantify the dynamic patterns of network evolution. We also propose the application of TTM to link prediction with an algorithm (called TTM-predictor) which shows good performance, especially for sparse networks analyzed in short time scales. The future applications and research directions of our approach are also proposed and discussed
Density Preserving Sampling: Robust and Efficient Alternative to Cross-validation for Error Estimation
Estimation of the generalization ability of a classi-
fication or regression model is an important issue, as it indicates
the expected performance on previously unseen data and is
also used for model selection. Currently used generalization
error estimation procedures, such as cross-validation (CV) or
bootstrap, are stochastic and, thus, require multiple repetitions
in order to produce reliable results, which can be computationally
expensive, if not prohibitive. The correntropy-inspired density-
preserving sampling (DPS) procedure proposed in this paper
eliminates the need for repeating the error estimation procedure
by dividing the available data into subsets that are guaranteed to
be representative of the input dataset. This allows the production
of low-variance error estimates with an accuracy comparable to
10 times repeated CV at a fraction of the computations required
by CV. This method can also be used for model ranking and
selection. This paper derives the DPS procedure and investigates
its usability and performance using a set of public benchmark
datasets and standard classifier
Scaling beyond one rack and sizing of Hadoop platform
This paper focuses on two aspects of configuration choices of the Hadoop platform. Firstly we are looking to establish performance implications of expanding an existing Hadoop cluster beyond a single rack. In the second part of the testing we are focusing on performance differences when deploying clusters of different sizes. The study also examines constraints of the disk latency found on the test cluster during our experiments and discusses their impact on the overall perfor- mance. All testing approaches described in this work offer an insight into understanding of Hadoop environment for the companies looking to either expand their existing Big Data analytics platform or implement it for the first time
Physically inspired methods and development of data-driven predictive systems
Traditionally building of predictive models is perceived as a combination of both science and art. Although the designer of a predictive system effectively follows a prescribed procedure, his domain knowledge as well as expertise and intuition in the field of machine learning are often irreplaceable. However, in many practical situations it is possible to build wellâperforming predictive systems by following a rigorous methodology and offsetting not only the lack of domain knowledge but also partial lack of expertise and intuition, by computational power. The generalised predictive model development cycle discussed in this thesis is an example of such methodology, which despite being computationally expensive, has been successfully applied to realâworld problems. The proposed predictive system design cycle is a purely dataâdriven approach. The quality of data used to build the system is thus of crucial importance. In practice however, the data is rarely perfect. Common problems include missing values, high dimensionality or very limited amount of labelled exemplars. In order to address these issues, this work investigated and exploited inspirations coming from physics. The novel use of wellâestablished physical models in the form of potential fields, has resulted in derivation of a comprehensive Electrostatic Field Classification Framework for supervised and semiâsupervised learning from incomplete data. Although the computational power constantly becomes cheaper and more accessible, it is not infinite. Therefore efficient techniques able to exploit finite amount of predictive information content of the data and limit the computational requirements of the resourceâhungry predictive system design procedure are very desirable. In designing such techniques this work once again investigated and exploited inspirations coming from physics. By using an analogy with a set of interacting particles and the resulting Information Theoretic Learning framework, the Density Preserving Sampling technique has been derived. This technique acts as a computationally efficient alternative for crossâvalidation, which fits well within the proposed methodology. All methods derived in this thesis have been thoroughly tested on a number of benchmark datasets. The proposed generalised predictive model design cycle has been successfully applied to two realâworld environmental problems, in which a comparative study of Density Preserving Sampling and crossâvalidation has also been performed confirming great potential of the proposed methods.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Probabilistic Approach to Structural Change Prediction in Evolving Social Networks
We propose a predictive model of structural
changes in elementary subgraphs of social network based on
Mixture of Markov Chains. The model is trained and verified
on a dataset from a large corporate social network analyzed
in short, one day-long time windows, and reveals distinctive
patterns of evolution of connections on the level of local
network topology. We argue that the network investigated in
such short timescales is highly dynamic and therefore immune
to classic methods of link prediction and structural analysis,
and show that in the case of complex networks, the dynamic
subgraph mining may lead to better prediction accuracy. The
experiments were carried out on the logs from the Wroclaw
University of Technology mail server
Robust predictive modelling of water pollution using biomarker data
This paper describes the methodology of building a predictive model for the
purpose of marine pollution monitoring, based on low quality biomarker data.
A stepâbyâstep, systematic data analysis approach is presented, resulting in
design of a purely dataâdriven model, able to accurately discriminate between
various coastal water pollution levels.
The environmental scientists often try to apply various machine learning
techniques to their data without much success, mostly because of the lack of
experience with different methods and required âunder the hoodâ knowledge.
Thus this paper is a result of a collaboration between the machine learning and
environmental science communities, presenting a predictive model development
workflow, as well as discussing and addressing potential pitfalls and difficulties.
The novelty of the modelling approach presented lays in successful application
of machine learning techniques to high dimensional, incomplete biomarker
data, which to our knowledge has not been done before and is the result of close
collaboration between machine learning and environmental science communities
Towards cost-sensitive adaptation: when is it worth updating your predictive model?
Our digital universe is rapidly expanding,more and more daily activities are digitally recorded, data arrives in streams, it needs to be analyzed in real time and may evolve over time. In the last decade many adaptive learning algorithms and prediction systems, which can automatically update themselves with the new incoming data, have been developed. The majority of those algorithms focus on improving the predictive performance and assume that model update is always desired as soon as possible and as frequently as possible. In this study we consider potential model update as an investment decision, which, as in the financial markets, should be taken only if a certain return on investment is expected. We introduce and motivate a new research problem for data streams ? cost-sensitive adaptation. We propose a reference framework for analyzing adaptation strategies in terms of costs and benefits. Our framework allows to characterize and decompose the costs of model updates, and to asses and interpret the gains in performance due to model adaptation for a given learning algorithm on a given prediction task. Our proof-of-concept experiment demonstrates how the framework can aid in analyzing and managing adaptation decisions in the chemical industry
- âŚ